JOPSS - Search Results

Search Results: Records 1-5 displayed on this page of 5

Presentation/Publication Type

Initialising ...

Refine

Journal/Book Title

Initialising ...

Meeting title

Initialising ...

First Author

Initialising ...

Keyword

Initialising ...

Language

Initialising ...

Publication Year

Initialising ...

Held year of conference

Initialising ...

Journal Articles

Performance portability with C++ parallel algorithm

Asahi, Yuichi; Padioleau, T.*; Latu, G.*; Bigot, J.*; Grandgirard, V.*; Obrejan, K.*

Dai-36-Kai Suchi Ryutai Rikigaku Shimpojiumu Koen Rombunshu (Internet), 8 Pages, 2022/12

We implement a kinetic plasma simulation code with multiple performance portable frameworks and evaluated its performance on Intel Icelake, NVIDIA V100 and A100 GPUs, and AMD MI100 GPU. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate a performance portable implementation without harming the readability and productivity. With stdpar, we obtain a good overall performance for a kinetic plasma mini-application in the range of 20% to the Kokkos version on Icelake, V100, A100 and MI100. We conclude that stdpar can be a good candidate to develop a performance portable and productive code targeting Exascale era platforms, assuming this programming model will be available on AMD and/or Intel GPUs in the future.

Journal Articles

Performance portable Vlasov code with C++ parallel algorithm

Asahi, Yuichi; Padioleau, T.*; Latu, G.*; Bigot, J.*; Grandgirard, V.*; Obrejan, K.*

Proceedings of 2022 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) (Internet), p.68 - 80, 2022/11

https://doi.org/10.1109/P3HPC56579.2022.00012

Times Cited Count：1 Percentile：0(Computer Science, Theory & Methods)

This paper presents the performance portable implementation of a kinetic plasma simulation code with C++ parallel algorithm to run across multiple CPUs and GPUs. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate that a performance portable implementation is possible without harming the readability and productivity. We obtain a good overall performance for a mini-application in the range of 20% to the Kokkos version on Intel Icelake, NVIDIA V100, and A100 GPUs. Our conclusion is that stdpar can be a good candidate to develop a performance portable and productive code targeting the Exascale era platform, assuming this approach will be available on AMD and/or Intel GPUs in the future.

Journal Articles

An Estimation of complexity and computational costs for vertical block-cyclic distributed parallel LU factorization

Imamura, Toshiyuki

Journal of Supercomputing, 15(1), p.95 - 110, 2000/00

https://doi.org/10.1023/A:1008121726802

Times Cited Count：2 Percentile：30.01(Computer Science, Hardware & Architecture)

no abstracts in English

JAEA Reports

Speedup of MCACE, a Monte Carlo code for evaluation of shielding safety, by parallel computer,(1); Paralleling of MCACE code and performance evaluation by simulator

*; ; Masukawa, Fumihiro; Naito, Yoshitaka; Minami, Kazuyoshi*

JAERI-M 91-066, 77 Pages, 1991/04

JAERI-M-91-066.pdf:1.82MB

no abstracts in English

Oral presentation

Performance portability of Ensemble Kalman Filter using C++ senders/receivers

Asahi, Yuichi; Hasegawa, Yuta; Padioleau, T.*; Millan, A.*; Bigot, J.*; Grandgirard, V.*; Obrejan, K.*

no journal, ,

Generally, production-ready scientific simulations consist of many different tasks including computations, communications and file I/O. Compared to the accelerated computations with GPUs, communications and file I/O would be slower which can be major bottlenecks. It is thus quite important to manage these tasks concurrently to suppress these costs. In the present talk, we employ the proposed language standard C++ senders/receivers to mask the costs of communications and file I/O. As a case study, we implement a 2D turbulence simulation code with the local ensemble transform Kalman filter (LETKF) using C++ senders/receivers. In LETKF, the mock observation data are read from files followed by MPI communications and dense matrix operations on GPUs. We demonstrate the performance portable implementation with this framework, while exploiting the performance gain with the introduced concurrency.